1 Introduction

This document will describe how to use AvenioUpdate to update and explore data generated with the Avenio pipeline at AUH. The package is dependent on two files . The first file is an R file (.rds) which contains all the Avenio results for each sample. This file can be loaded into R where the data is visible. The second file is an excel file (.xlsx) which contains all the basic information about the samples run on the NGS machine. Both files are located at //Synology_m1/Synology_folder/AVENIO/ and will be explained in much greater detail below.

As of late 2025 the package contains four main functions (add_run_to_list(), add_new_key(),extract_project() & create_simple_output()) which are used to update the results file and get an overview of the results for a single patient. In addition, the package contains some smaller functions which are used to get some statistics on the data we have collected, remove entries from results, and to make sure the results file is updated correctly.

2 Installation

This package is not a real package given it is not published at places like CRAN or Bioconductor. However, it can still be installed via my github page as demonstrated below:

# Only needs to be run once
if (!require(devtools)) install.packages('devtools')

library(devtools)

# Only needs to be run once (Or when the package is updated)
devtools::install_github("CTrierMaansson/AvenioUpdate")

# Needs to be run every time you open RStudio
library(AvenioUpdate)

To check if the package has been installed correctly run the following command:

result_stats_Info()

Which should give an output like this:

#>               Name                                                                  Description
#> 1        Basestats                  Number of samples, runs, patients, different materials etc.
#> 2     Projectstats                               Number of patients and samples in each project
#> 3          Missing  Samples present in AVENIO_runs.xlsx but not present in the results data set
#> 4    All_mutations                      Number of times each gene is mutated across all samples
#> 5     Relevant_SNV         Number of SNVs detected not classified as BC or synonymous mutations
#> 6   Relevant_INDEL       Number of INDELs detected not classified as BC or synonymous mutations
#> 7     BC_in_plasma Number of times each gene is mutated in plasma but classified as BC mutation
#> 8  Fusions_project  Number of patients for each project with detectable EML4-ALK with DNAfusion
#> 9   Fusions_sample   Number of samples for each project with detectable EML4-ALK with DNAfusion
#> 10 Fusions_variant                                Number of patients with each EML4-ALK variant
#> 11      Fusions_NC                          Information on the non classified EML4-ALK variants
#> 12         Lengths                   Distribution of fragmentlengths across different materials
#> 13          Depths                     Distribution of unique depths across different materials
#> 14           Reads                      Distribution of mapped reads across different materials
#> 15       On_target                Distribution of on target percents across different materials

Sometimes installing devtools does not configure git correctly with R and you therefore have to install git manually.

If you use Windows install git via https://git-scm.com/downloads/win. Use the standalone installer 64-bit version. After you have installed git test if git has been installed correctly by opening the terminal. Press the windows bottom on your keyboard and type cmd and press enter. In the terminal type “git” and press enter. If git has been installed correctly you should see “usage: git [-v | –version] [-h | –help] …”. If you do not see this message. restart your computer and try again.

If you use Mac install git via https://git-scm.com/download/mac and use homebrew for the installation. Test if git has been installed correctly like explained above but open the terminal by pressing cmd + space and type Terminal.

2.1 Version

This manual was created using the following version of AvenioUpdate:

packageVersion("AvenioUpdate")
#> [1] '1.14.2'

Try and run packageVersion("AvenioUpdate"). If you do not get the same result then run the installation command:

devtools::install_github("CTrierMaansson/AvenioUpdate")

3 How it works

As mentioned, the package is dependent on two files located on the Synology. This means in order to use the package you must have access to the Synology folder. How to obtain this, is found in /OneDrive/Lung cancer group ALABS/How-to guides . Once you have gained access to the Synology you must stay connected using the Ethernet cable connection IPv4-adress: 10.124.6.78 (health.client.au) or IPv4-adress: 10.60.24.79 (onerm). Notice how both AU and region networks can establish the connection to synology.

3.1 The AVENIO_runs.xlsx file

This file is at //Synology_m1/Synology_folder/AVENIO/ and contains all the relevant information for all NGS samples across different projects. The file is used to connect CPR numbers and other relevant information with the NGS results from the Avenio output. By updating and using this file it serves as a key-file allowing type errors, poorly designed sample names, non-unique sample names across different runs, etc. in the Avenio system and maintain a consistency across all our projects. Yes! Very neat!

However…

THIS FILE IS UPDATED MANUALLY

The file can be opened in excel where the samples can be added accordingly. It is VERY important that samples/entries are not deleted from the file. Christoffer makes sure there is a backup of the file on GenomeDK, which is updated regularly. If you make errors when entering your information, no worries, this can be fixed without any repercussions. When executing add_run_to_list() to update the results, R runs several checks on the information in AVENIO_runs.xlsx However not all errors can get caught so if you find an error in the AVENIO_runs.xlsx after add_run_to_list() is done then the error has to be fixed with either remove_run() or remove_sample_index() (see below.

But most importantly. It is very hard to fetch deleted data.

So as long as you don’t delete anything or change entries not belonging to you all is fine!

Nevertheless, there are some rules regarding the file:

  1. Do not delete anything you are not supposed to
  2. Do not delete anything you are not supposed to
  3. Never move the file from its directory //Synology_m1/Synology_folder/AVENIO/
  4. Do not make your own copy of the file and store it locally
  5. Format the information in the correct format (see below)
  6. Try to avoid adding samples with incomplete information
  7. Avoid the use of special characters like ”ÆØÅ½#$” etc. but “_” is okay, EXCEPT IN PROJECT AND NAME_IN_PROJECT!
  8. And do not delete anything you are not supposed to

3.1.1 Formatting entries in AVENIO_runs.xlsx

This is an example of some of the entries in the file: To ensure no CPR numbers from real patients are displayed, the CPRs here are randomly generated and are therefore “useless”

CPR Name_in_project Project Sample_date Run_name Run_ID Sample_name Sample_note Material
1401362941 AW373401PDZHAA5W MonAlec 2024-06-28 20241030 AKmgHQ2-w3RHTqVjtWTKLz60 mona_0912742407_RS_2L Tx cfDNA
1902514323 pt19 Batezo 2021-03-15 20210415 ABSYoqHZHR9Gvo_Z0scoIQ8w 742prae Unknown cfDNA
1902514323 pt19 Batezo 2021-04-28 20210616 ADNC-IlUdsROnLSz2AnnTpGz BATEZ_2_1301410014 Tx cfDNA
1902514323 pt19 Batezo 2021-05-21 20210616 ADNC-IlUdsROnLSz2AnnTpGz BATEZ_PROG3_1301410014 Tx cfDNA
1902514323 pt19 Batezo 2021-03-15 20231016 ALsopAc5KC1Bx7pJLiHZ-eJR pt19bc BC BC
2409291076 2900 Josephine-MRD 2013-04-22 20220408 ASl7tgQWvuhA44Kx9otONfK0 2900post2 Tx cfDNA
2803595505 3296 Josephine-MRD 2012-11-09 20220504 AceQabkgtrVDvohQ_gcEx7Yf 3296bc BC BC
1809485424 pt28 cfChIP 2017-12-27 20220204 AY-MEcYGsnlEp70J19fZmwOJ G-514-ChIP Unknown cfChIP

3.1.2 Variables in AVENIO_runs.xlsx

This table briefly shows the general requirements for all variables in AVENIO_runs.xlsx. For details please see the descriptions below.

For a detailed description of how these variables are combined and used context to different types of samples and analyses, see the section on sample_index

Variable Included in sample_index Uniqueness Formatting requirements add_new_key() required Is required Can contain ’_’
CPR No Within project Patient specific No Yes Yes
Name_in_project Yes Within project Project specific No Yes No
Project Yes None Project specific Yes Yes No
Sample_number No None None No No Yes
Sample_date Yes None YYYY-MM-DD No Yes No
Run_name No Run specifc None No No Yes
Run_ID No Run specific 24 characters No Yes Yes
Sample_name No Within run None No Yes Yes
Sample_note (Yes) None None Yes Yes Yes
Material (Yes) None None Yes Yes Yes

Sample_note and Material have (Yes) in the “included in sample_index” column, because it is dependent on what is included in these columns whether the variable is included in sample_index (see below)

3.1.2.1 CPR

The CPR number of the patient if such number is known/exists. This number is unique to every patient and is used to explore the results. The number is used to extract the information from each patient in create_simple_output().

As explained below, AVENIO_results_patients.rds is a named list of data.frames where each data.frame is named with the CPR number of the patient.

Therefore you will able to explore all Avenio information, including sample metrics using:

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")

# Extracting the data for a specific patient using the CPR number
results$`<CPR_number>` 

OR

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds") 

# Extracting the data for a specific patient using the CPR number
results[[`<CPR_number>`]] 

3.1.2.2 Name_in_project

This is the name given to a patient in a project. This name must be unique for the patients within a specific project, but the same name can be used across different projects such as pt26, 26, patient26 etc. Because of how the package works ”_” cannot be included in Name_in_project

It is very important that for a given project a single CPR can only be assigned to a single Name_in_project. The same CPR can be included in different projects with different Name_in_project.

If any entry in the file contains “_” in Name_in_project or multiple Name_in_project are assigned to the same CPR number, it will result in an ERROR

3.1.2.3 Project

This is the project the patient is part of. This is used to group patients as unique entries. It can also be used to quickly collect the patient information for your specific project by filtering on the project name. Because of how the package works ”_” cannot be included in Project

If any entry in the file contains “_” in Project, it will result in an ERROR

To include a new project run: add_new_key() and to see current included projects run: readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_keys.rds")

3.1.2.4 Sample_date

This is the date of the sample collection. It is important to format the date correctly. The date should be formatted as YYYY-MM-DD. It happens often that the sample dates for buffy coat (BC) and baseline (BL) samples are identical. This is not an issue as long as the Material column is also filled out correctly.

When the data is collected and analyzed the algorithm checks the format of the date and if it is not correct it will result in an ERROR

3.1.2.5 Sample_number

This is number corresponding to the blood sample. This is not used by the code but it can be useful to connect sample dates and sample numbers. If the number is unknown you can just write “Unknown”.

3.1.2.6 Run_name

This is the name of the run as it appears in the Avenio system. While this name is not used specifically in the algorithm it is good to have the name in the file to have a record of which Run_name matches which Run_ID.

3.1.2.7 Run_ID

This is the name of the Run_ID which is randomly generated by the Avenio system. This is an important ID because it is the name of the folder where the results from Avenio are located on the Synology server. The Run_ID is always 24 characters long and starts with an “A”. If the algorithm detects a Run_ID that is not 24 characters long it will result in an ERROR

3.1.2.8 Sample_name

This is the name of the sample as it appears in the Avenio system. This name is only unique to the specific run and is used in the algorithm to extract the BAM files. The algorithm tests if the specific run contains a folder with the specific Sample_name and if it cannot find a folder with the specific Sample_name it will result in an ERROR.

3.1.2.9 Sample_note

This is a note that can be added to the sample and it is not used in the algorithm. However, it is nice to include to get a quick overview of the samples that have been analyzed for the specific patient. In general the variable refers to time of blood sampling in relation to treatment initiation or if the sample is a BC sample. There is no limitation on what can be included in this column although new possible entries has to be added with add_new_key()

To see current included Sample_notes run: readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_keys.rds")

3.1.2.10 Material

This is a mandatory column. The variable refers to the source of the material analyzed. In most cases this is “cfDNA” or “BC” which reflects purified plasma cfDNA or DNA from PBMCs, respectively. Other sources include “size_selection” or “cfChIP”. This is a mandatory column because it allows the same blood sample (same Project, Name_in_project, Sample_date) to be analyzed multiple times. If Material is not filled out the algorithm will result in an ERROR.

The classification of “BC” samples is important because non-BC samples from a patient have variants “flagged” according to the variants detected in the BC sample.

reanalyze has been added as a type of Material. This is used when a sample has been analyzed more than once. This can be used to discriminate between the two analyses which would otherwise have the same Project, Name_in_project, and Sample_date.

If the NGS has been executed using the tissue version of AVENIO put in tissue in Material. This tells AvenioUpdate that the output format of the results .csv files are different from the plasma version of AVENIO. If it is a BC sample that has been analyzed with the tissue version of AVENIO put in “BC” in Sample_note and “tissue” in Material. If it is a tumor sample that has been analyzed with the tissue AVENIO protocol use “tumor_BL” or “tumor_Tx” in Sample_note.

There is no limitation on what can be included in this column although new possible entries has to be added with add_new_key()

To see current included Materials run: readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_keys.rds")

3.2 The AVENIO_results_patients.rds file

This is the other important file located at //Synology_m1/Synology_folder/AVENIO/, however this file is NOT updated manually by opening and editing it. Instead, this file is updated automatically by the algorithm when add_run_to_list() is executed. This means, when you have updated the AVENIO_runs.xlsx file with new samples and you want to update the results, you run the following:

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds") 

# Example path to the specific run
test_path <- "//Synology_m1/Synology_folder/AVENIO/AVENIO_results/Plasma-ANei_Hsi__lGmaJ1k_Ji_d3O" 

# Updating the results file
results <- add_run_to_list(master_list = results, 
                           Directory = test_path)

Then results will contain all of the information from the AVENIO_runs.xlsx file and AVENIO_results_patients.rds has been updated accordingly.

3.2.1 More technical note on AVENIO_results_patients.rds (Not mandatory read)

AVENIO_results_patients.rds is an .rds file which means it is a saved R object. The file therefore cannot be explored outside of R. However, the file can be loaded into R where the data is visible. The file is a named list of data.frames where each data.frame is named after the CPR number of the patient.

All the information from the Avenio system is stored in the data.frames which is extracted from the filtered variants located in the .csv files. If no variants have been detected for the sample, all of the other sample metrics such as the median depth, fragment length, base quality, etc. are still stored in the the data.frames. This is also to register that the sample has been analyzed and if other samples with detectable variants have been analyzed for that patient these variants are being investigated in the original BAM files for the patient.

3.3 sample_index

The variable sample_index is created for all samples when add_run_to_list() is executed. The essential idea behind this variable is to combine the most basic information about an analyzed sample and create a unique and consistent name for that sample. The sample_index naming is inspired by the naming strategy created by Lærke, which is explained in more detail on OneDrive at /OneDrive/Lung cancer group ALABS/How-to guides/How to navngiv patientprøver.pptx

The sample_index is created using the AVENIO_runs.xlsx file using the following variables:

  1. Project
  2. Name_in_project
  3. Sample_date
  4. Material
  5. Sample_note

And with some code which is run internally in AvenioUpdate:::create_sample_index(), the following decision tree is used to determine how the sample_index is formatted and contains the relevant information.

The figure shows how AvenioUpdate runs through the variables for each sample and based on the inputs it combines the variables in different formats (grey boxes). The reason behind the decision tree:

  1. Able to handle samples sequenced with plasma or tissue Avenio protocols
  2. Able to identify BC samples
  3. Able to combine results of two analyses of the same sample aka. reanalyze
  4. Able to add information regard processing of the sample, e.g. size_selection

In the bottom of the figure I have given an example of how the sample_index can look for a patient in the FIOL cohort where multiple samples have been analyzed for that patient. Samples A to G are further explained in the figure below and shows what inputs are used in the AVENIO_runs.xlsx file.

sample_index is exported as a variable in create_simple_output() as explained below to give the best sample context.

3.4 Synology path

The pacakge is dependent on access to the synology folder which can be established through the how-to-guide: /OneDrive/Lung cancer group ALABS/How-to guides . At times, this connection is established though different mechanisms and this can affect the path to the synology folder. The correct (meaning default) path
is //Synology_m1/Synology_folder/ and has been hard coded into all the functions. This is also the path which is used in this manual.

However, if the connection to the synology folder only can get established through the IP adress. Then the path could be: //10.124.39.251/Synology_folder/, or any other path has been created between your computer and the Synology (See above). Then use the synology_path argument to specify the exact connection which you have established to the synology. If you do not have the default path you should specify the synology path every time you run functions similar to the example below:

# defining path
syn_path <- "//10.124.39.251/Synology_folder/AVENIO/"

#Example of function needing the synology_path argument
add_run_to_list(...,
                synology_path = syn_path)

# '...' means "additional arguments"

And the results file can only be read using:

#Reading the Avenio results:
results <- readRDS("//10.124.39.251/Synology_folder/AVENIO/AVENIO_results_patients.rds") 

4 Main functions

4.1 add_run_to_list()

As mentioned, this is the most important function in the package. The function is used to update the AVENIO_results_patients.rds file with any new information from the AVENIO_runs.xlsx file.

4.1.1 How it works

How the function works is illustrated below:

Fig 1. Illustration of how `add_run_to_list()` works by mergnig information from different files

Fig 1. Illustration of how add_run_to_list() works by mergnig information from different files

The function takes two mandatory arguments:

  1. master_list which is the AVENIO_results_patients.rds loaded in using readRDS()
  2. Directory which is the complete path to the specific run on the Synology server

And one optional argument:

  1. synology_path which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)

As illustrated in the figure above the function will extract the relevant information from the AVENIO_runs.xlsx file. Then it will look through the established results on all the patients (using the CPR numbers) and collect the new run with previous runs from patients in the new run. Then it will take all the identified mutations for each patient and look for that mutation in all BAM files generated for the patient. After this, if a BC sample exists for the patient, all mutations identified in the non-BC samples are flagged as BC mutations if they are found the BC sample as well. Then the results dataset is updated with the new runs and existing runs are updated after the BAM files have been reanalyzed.

In addition to this, the newly added samples will also be analyzed with DNAfusion which is our most sensitive method for detecting EML4-ALK fusions. If a fusion is identified we try and classify the fusion variant using the classification from this publication

4.1.2 Flags and BC mutations

The Flags variable is included to give some context on the mutation. If the mutation is not not found in the Avenio system, but only found when the BAM file is anlyzed the flag is set to “BAM”. Then, the MAF and variant depth is also determined from the BAM file, whereas the standard MAF and variant depth is determined from the Avenio system.

“DNAfusion” is shown in Flags if the EML4-ALK fusion is detected by DNAfusion. If the fusion is also detected in the Avenio system, that fusion is maintained in the output but is not indicated in Flags

When add_rund_to_list() is executed the internal function AvenioUpdate:::renanalyze_samples() annotates the mutations found in non-BC samples as BC mutations if they are also found in a BC sample. Just a single mutant read in the BC sample is enough for AvenioUpdate to identify that specific mutation in the BC sample. Because of this we discriminate between certain and uncertain BC mutations in order to filter variants in the non-BC samples based on variyng confidence of the mutations called in the BC sample:

As shown in the figure Certain BC mutations are mutations have been identified in the BC sample by AVENIO OR in the BAM files with at least 3 mutant reads. These mutations are annotated with ‘BC_mut’ in the Flags variable.

Uncertain BC mutations are classified as ’uncertain_BC_mut’in the Flags variable. These mutations have not been identified by AVENIO in the BC sample but only in the BAM files AND the mutation is identified with less than 3 mutant reads in the BAM file of the BC sample.

4.1.3 Output

The output of the function is the updated results master_list (named list of data.frames). This is automatically saved as the AVENIO_results_patients.rds file and now

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds") 

Means results now contain the Avenio results updated with the most recent samples.

4.1.3.1 Tissue vs. plasma NGS

The output of the function is essentially a collection of the data presented in the results .csv files created by AVENIO. HOWEVER, the tissue NGS and plasma NGS does not have the same variables in the output and we wanted to collect results from both types of analyses in one dataset.

For example tissue NGS does not have a Plasma.Volume..mL. variable which is present in the plasma NGS output, and plasma NGS does not have a Sample.Primer variable which is used for the tissue NGS, etc.

Because of this, AvenioUpdate has to know if the sample being analyzed has been created with the tissue or plasma NGS protocol (See details in Material). This also means that the final output across all included individuals also contain variables from both tissue and plasma NGS even though only a few patients have been investigated with tissue NGS. The reason for this is because I have to concatenate all samples and this is only possible if all rows have the same number of variables and variable names

However, the tissue variables for samples analyzed with plasma NGS just contain NA and can therefore be completely ignored and vice versa for tissue NGS as demonstrated in the image below:

Tissue NGS samples also have a unique sample_index structure. So take a look in the sample_index section for more information.

4.1.4 Example

An example of how the function is used is shown below:

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds") 

# Example path to the specific run
test_path <- "//Synology_m1/Synology_folder/AVENIO/AVENIO_results/Plasma-ANei_Hsi__lGmaJ1k_Ji_d3O" 

# Updating the results file
results <- add_run_to_list(master_list = results,
                           Directory = test_path)

4.1.5 End messages

Several messages are printed during the execution of the function. However, for everyday use only the last few messages are of importance. An example of this could be:

#> 1
#> Before the dataset consisted of 223 individuals and 572 samples analyzed
#> 2
#> Now the dataset consists of 223 individuals and 586 samples analyzed
#> 3
#> The following projects have been updated with this many samples
#> # A tibble: 1 × 2
#>   Project     n
#>   <chr>   <dbl>
#> 1 MonAlec    14
#> 4
#> And the following samples have been added to the dataset
#>  [1] "MonAlec_83Z7BF2N4Y12L933_200723_BC" "MonAlec_YZ0PPZ1A5JLCB27C_220211_BC"
#>  [3] "MonAlec_K6YFO3FWXOFI99UZ_201123_BC" "MonAlec_CQHQ4QA5K295LU8D_190911_BC"
#>  [5] "MonAlec_B14Q10RS3BRU807A_210304_BC" "MonAlec_6KYRWVT7W2KW1U48_220208_BC"
#>  [7] "MonAlec_PQ9NESTYL82NBF9B_200813_BC" "MonAlec_UE2R8PRMAT4T3IXD_190611_BC"
#>  [9] "MonAlec_2UCZEDI4F35T6ILM_200728_BC" "MonAlec_IT8BJVO03C08FV04_191002_BC"
#> [11] "MonAlec_YBNHFHWVB0HBC3YY_200115_BC" "MonAlec_9Y6DEL9FY4T37CX7_201202_BC"
#> [13] "MonAlec_4OOA9FR46IB4BXBJ_191101_BC" "MonAlec_FDEDCZO3IR8TENC0_211119_BC"
#> 5
#> Saving updated list of patients
  1. The first message explains how many individuals and samples the dataset consisted of before the addition of the new run
  2. The second message explains how many individuals and samples the dataset consists of after the addition of the new run
  3. The third message explains which projects have been updated and how many samples have been added to each project
  4. The fourth message prints the sample_index (see above) that have been added to the dataset
  5. The fifth message explains that the updated list of patients has been saved

It is good practice to look through these messages after add_run_to_list() has been executed, to make sure the data has been updated as expected.

4.2 add_new_key()

This is the second main function of the package and is used to add new possible entries to the Project, Sample_note, and Material columns in the AVENIO_runs.xlsx file. This step is included to ensure typos are spotted so you don’t accidentally assign a wrong e.g. project name to a sample.

The function takes two mandatory arguments

  1. key which is the key name you want to add as possible entry
  2. variable which is the variable name in the AVENIO_runs.xlsx you want to add the key to.

And one optional argument:

  1. synology_path which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)

Current included entries for Project, Sample_note, and Material can be viewed with:

readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_keys.rds")

4.2.1 Output

There is no output of this function, just a message telling you what variable has been updated with the new key

4.2.2 Example

An example of how the function can be used

add_new_key(key = "test",
            variable = "Project")

This should print the following messages

#> Adding the new key: 'test' for the variable: 'Project'.
#> DONE!

4.3 create_simple_output()

This is the third main function of the package and is used to get a quick overview of a specific patient. The function takes two mandatory arguments:

  1. df_list which is the AVENIO_results_patients.rds loaded in using readRDS()
  2. CPR_number which is the CPR number of the patient you want to investigate

And two optional argument:

  1. synonymous which is a logical variable that determines if synonymous mutations should be included in the output. (Default: TRUE)
  2. synology_path which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)

4.3.1 Output

The output is a list with two entries. The first entry is just the CPR number to understand which patient the information is gathered from. The second entry is a data.frame with the following 14 variables where each row is a gene mutation detected in the patient:

  1. sample_index - Unique for each sample and is explained above
  2. Class - Classification of the mutation (FUSION,INDEL,SNV)
  3. Gene - The gene where the mutation is located
  4. AA - The amino acid change
  5. Description - Description of the mutation e.g. missense
  6. Flags - Flags for the mutation (detailed below)
  7. MAF - The mutational allele fraction
  8. Variant_depth - The number of identified reads with the variant
  9. Unique_depth - The number of unique reads on the position of the variant
  10. Analysis - Name of the sequencing run in the Avenio system
  11. Sample.ID - Name of the sample in the Avenio system
  12. Sample.note - Note of the sample in the AVENIO_runs.xlsx file (BL, Tx, BC, Unknown)
  13. Material - Material of the sample in the AVENIO_runs.xlsx file (cfDNA, BC, size_selection, cfChIP)
  14. Notes - Any notes manually added in the AVENIO_runs.xlsx file

If no mutations are detected in a sample for a patient the run is still added but Gene, AA etc. variables just contain NA.

For now which variables are included in the output is set to these specific variables. However, upon request I will look into ways to modify the output so the many variables in the Avenio .csv files also can be extraced in this simple format.

4.3.2 Example

An example of how the function is used is shown below: Again, the CPR number is not a real CPR number

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds") 

# Extracting the simple output
overview <- create_simple_output(df_list = results,
                                 CPR = "1401362941",
                                 synonymous = F)

# CPR number
overview[[1]] 

# Data.frame with the 14 variables
overview[[2]] 

Which gives the following output:

#> [1] "1401362941"
sample_index Class Gene AA Description Flags MAF Variant_depth Unique_depth Analysis Sample.ID Sample_note Material Notes
NARLAL_60_191113 SNV CDH18 p.Arg134His Missense BC_mut 0.17% 17 9734 20221125 60A BL cfDNA NA
NARLAL_60_191113 SNV EGFR p.Arg149Trp;p.Arg149Trp;p.Arg149Trp;p.Arg149Trp Missense;Missense;Missense;Missense BAM, BC_mut 0.05% 4 7820 20221125 60A BL cfDNA NA
NARLAL_60_191113 SNV ERBB2 p.Cys334Ser;p.Cys334Ser;p.Cys319Ser Missense;Missense;Missense 0.15% 13 8434 20221125 60A BL cfDNA NA
NARLAL_60_191113 SNV MET p.Thr1010Ile;p.Thr992Ile Missense;Missense 0.48% 39 8186 20221125 60A BL cfDNA NA
NARLAL_60_191113 SNV PDZRN3 p.Leu650Gln;p.Leu367Gln Missense;Missense BAM, BC_mut 0.03% 3 10695 20221125 60A BL cfDNA NA
NARLAL_60_191113 SNV RET p.Val706Met;p.Val706Met Missense;Missense BAM, BC_mut 0.02% 3 17920 20221125 60A BL cfDNA NA
NARLAL_60_191113_BC SNV APC p.Ser837* Stop gained 1.68% 171 10203 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV BRCA2 p.Ile1633Phe Missense 0.30% 26 8576 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV CDH18 p.Arg134His Missense 0.14% 18 13199 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV DCAF12L2 p.Leu288Met Missense 0.16% 16 10151 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV EGFR p.Arg149Trp;p.Arg149Trp;p.Arg149Trp;p.Arg149Trp Missense;Missense;Missense;Missense 0.12% 12 9992 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV GPR139 p.Gly224Glu Missense 0.10% 13 12497 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV KIT p.Lys807Asn Missense 1.74% 240 13769 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV MKRN3 p.Ala51Val;p.Ala51Val Missense;Missense 0.26% 34 13031 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV NAV3 p.Gly589Cys;p.Gly589Cys Missense;Missense 0.16% 18 11207 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV NFE2L2 p.Asp29Tyr Missense 0.17% 13 7642 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV NYAP2 p.Ala307Thr Missense 0.09% 12 13741 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV PDZRN3 p.Leu650Gln;p.Leu367Gln Missense;Missense 0.65% 92 14148 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV PIK3CG p.His295Gln Missense 1.19% 136 11420 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV RET p.Val706Met;p.Val706Met Missense;Missense 0.07% 18 26821 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV USP29 p.Asp790His Missense 0.79% 80 10070 20221208 narlal_60_BC BC BC NA
NARLAL_60_191113_BC SNV WIPF1 p.Val152Gly;p.Val152Gly;p.Val152Gly Missense;Missense;Missense 0.11% 19 16588 20221208 narlal_60_BC BC BC NA
NARLAL_60_200205 SNV CDH18 p.Arg134His Missense BC_mut 0.20% 10 5100 20221125 60B Tx cfDNA NA
NARLAL_60_200205 SNV NYAP2 p.Ala307Thr Missense BAM, BC_mut 0.02% 1 4932 20221125 60B Tx cfDNA NA

4.4 extract_project()

This function is used to extract all the results in a project to be used for downstream analyses. Because the output is quite large some understanding of how to explore data.frames is required to fully explore the results.

The function takes two mandatory arguments:

  1. df_list which is the AVENIO_results_patients.rds loaded in using readRDS()
  2. project which is the project name from the AVENIO_runs.xlsx file

And three optional argument:

  1. synonymous which is a logical variable that determines if synonymous mutations should be included in the output. (Default: TRUE)
  2. simple which is a logical variable that determines if the output should contain all information (simple = FALSE, default) or simple output (simple = TRUE) which resembles the output from create_simple_output()
  3. synology_path which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)

4.4.1 Output

A tibble where each row is a mutated gene and 15 or 89 columns. 14 of the columns corresponds to variables explained under create_simple_output(). The output contains 15 columns if simple = TRUE where the last variable represents the patient CPR. If simple = FALSE (default) the output contains all the variables from the AVENIO .csv files coupled to the CPR numbers and other information from AVENIO_runs.xlsx

4.4.2 Example

An example of how the function is used is shown below: Again, the CPR number is not a real CPR number. The output is quite large (many rows) so I have only included 10 rows

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds") 

# Extracting project information
extract_project(df_list = results,
                project = "Pembrolizumab",
                synonymous = FALSE,
                simple = TRUE)

Which gives the following output:

CPR sample_index Class Gene AA Description Flags MAF Variant_depth Unique_depth Analysis Sample.ID Sample_note Material Notes
2801452016 Pembrolizumab_10_180912 SNV SLC39A12 p.Gly453;p.Gly453 Stop gained;Stop gained BAM 0.05% 2 4346 20201217 monbazo olink1 olink_pt10_prae BL cfDNA NA
2801452016 Pembrolizumab_10_181024 SNV SLC39A12 p.Gly453;p.Gly453 Stop gained;Stop gained 0.16% 8 5073 20210205 OLINK_PT10_BP3 Tx cfDNA NA
2109514436 Pembrolizumab_11_181024 CNV EGFR N/A N/A N/A N/A N/A N/A 20201217 monbazo olink1 olink_pt11_prae BL cfDNA NA
2109514436 Pembrolizumab_11_181024 SNV EYS p.Cys57;p.Cys57;p.Cys57* Stop gained;Stop gained;Stop gained 4.30% 200 4647 20201217 monbazo olink1 olink_pt11_prae BL cfDNA NA
2109514436 Pembrolizumab_11_181024 SNV FBXL7 p.Val166Met Missense 4.25% 168 3954 20201217 monbazo olink1 olink_pt11_prae BL cfDNA NA
2109514436 Pembrolizumab_11_181024 SNV GJA8 p.Val257Leu Missense BC_mut 0.22% 12 5344 20201217 monbazo olink1 olink_pt11_prae BL cfDNA NA
3008369755 Pembrolizumab_12_180629 SNV LRRTM4 p.Arg198;p.Arg199;p.Arg198;p.Arg199 Stop gained;Stop gained;Stop gained;Stop gained 0.21% 8 3797 20201217 monbazo olink1 olink_pt12_prae BL cfDNA NA
3008369755 Pembrolizumab_12_180629 SNV TP53 p.Ser215Ile;p.Ser215Ile;p.Ser215Ile;p.Ser215Ile;p.Ser204Ile Missense;Missense;Missense;Missense;Missense 0.31% 9 2902 20201217 monbazo olink1 olink_pt12_prae BL cfDNA NA
3008369755 Pembrolizumab_12_180629_BC SNV APC p.Ser837* Stop gained 1.13% 11 976 20230927 lunge1388_BC BC BC NA
3008369755 Pembrolizumab_12_180817 SNV LRRTM4 p.Arg198;p.Arg199;p.Arg198;p.Arg199 Stop gained;Stop gained;Stop gained;Stop gained BAM 0.12% 9 7329 20210216 OLINK_PT12_BP3 Tx cfDNA NA

5 Small functions (Not mandatory read)

As explained above, I have created some functions which can be used to explore the datasets a bit more. These functions are not mandatory to use but can be useful in order to get an overview of the projects, samples, runs, etc.

5.1 included_analyses()

This function is used to get an overview of the runs from AVENIO_runs.xlsx that are included in the results dataset.

The function takes one mandatory argument:

  1. master_list which is the AVENIO_results_patients.rds loaded in using readRDS()

5.1.1 Output

The output is a named list of lengths = 2. The first entry (“Overview:”) is a data.frame with three variables:

  1. Analysis.ID - The run ID from the AVENIO_runs.xlsx file
  2. n - Number of samples from that run that are included in the results dataset
  3. Analysis.Name - The name of the run in the Avenio system

The second entry (“Details:”) is a named list of data.frames where each data.frame is named with the run ID. Each data.frame contains four variables:

  1. sample_index - Unique for each sample and is explained above
  2. Sample.ID - Name of the sample in the Avenio system
  3. Analysis.Name - The name of the run in the Avenio system
  4. Analysis.ID - The run ID from the AVENIO_runs.xlsx file

Available entries can be shown using the following:

results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")
analyses <- included_analyses(results)
names(analyses[["Details:"]])

OR

results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")
analyses <- included_analyses(results)
analyses[["Overview:"]]$Analysis.ID

5.1.2 Example

An example of how the function is used is shown below:

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")

# Extract included analyses
analyses <- included_analyses(results)

# Getting an overview of the analyses
analyses[["Overview:"]]

Which should result in something like this:

Analysis.ID n Analysis.Name Analysis.Type
ADx0jDAXEJ9Ho7oP9cWlxCB- 16 20230818 Plasma
AdYl2rnCR1ZO5JLxgqP9fouS 16 20220531 Plasma
AE5hBlinjEZN6ZZ1L7898Rzc 1 20210204 Plasma
AEdiX1AaYJBFM4G7307KEA2u 16 FIOL6 Plasma
AEepGWqCaABI04g5DQESU97x 14 Copy_FaXb_20230112 Plasma
AEHqc78-RkhKwqARpiG9V9Qi 13 20211111 Plasma

And the second entry is viewed using:

# Reading the results file
results <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")

# Extract included analyses
analyses <- included_analyses(results)

# Exploring specific run
analyses[["Details:"]][["AGU232uEYZZKe4oI95IrJu7o"]]

Which should result in something like this:

sample_index Sample.ID Analysis.Name Analysis.ID Analysis.Type
MonAlec_00QWTDYUH1E0VRM8_200217 4 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
MonAlec_SEAVIJ1HQS7PSXM1_200217 3 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
MonAlec_9XPREVQOLXK3JE4S_200304 1 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
MonAlec_9XPREVQOLXK3JE4S_200401 2 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt08_200316 9 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt11_200326 12 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt11_200414 13 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt06_200211 7 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt09_200320 14 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt09_200416 15 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
MonAlec_YBNHFHWVB0HBC3YY_200211 5 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt05_200204 6 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt07_200306 8 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt10_200325 10 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Batezo_pt10_200415 11 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma
Madsen_A10_200407 16 Copy_rceq_Monalec21042020 AGU232uEYZZKe4oI95IrJu7o Plasma

5.2 explore_AVENIO_runs()

This function is used to get an overview of the AVENIO_runs.xlsx file and explore the sample entries that have been entered. This can be used to get an overview of which samples are missing information, which analyses have not been added to the results dataset, etc.

The function takes NO mandatory arguments, but takes three optional arguments:

  1. Info - Name of the information of interest, if Info = NULL (default) all information is displayed
  2. silent - A Boolean determining if messages should be displayed. If silent = FALSE (default) messages are displayed
  3. synology_path which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)

5.2.1 Output

A list with different information on the AVENIO_runs.xslx (Info = NULL) or the specific object as determined by Info.

5.2.2 Example

Two examples of how the function is used is shown below:

# Exploring the AVENIO_runs.xlsx file
results <- explore_AVENIO_runs(silent = TRUE)

# Getting the information of total entries
results$Total_entries

Should return:

#> [1] 1999

And:

# Getting the information on the samples with all the required information
explore_AVENIO_runs(Info = "Required",
                    silent = TRUE)

Should return something like:

CPR Name_in_project Project Sample_date Run_name Run_ID Sample_name Material Sample_note
2405460036 pt01 Batezo 2019-10-25 Copy_mrAB_Monalec_batezo_1 AajqaHzI9_hCH4wYCbDAIYx4 12 cfDNA Unknown
2405460036 pt01 Batezo 2019-10-02 Copy_WQIe_Monalec_batezo_m_salic ARkgLARFX_lCQL3jQm9qG7IA 12 cfDNA Tx
2405460036 pt01 Batezo 2020-04-14 Copy_hjRD_202008212 AEjxrRrIVjJGzbUFckTwpTpp B6 cfDNA Tx
2405460036 pt01 Batezo 2019-10-25 Copy_supo_20230303 AGwQDzxcqrdDlpzlcS10s1En pippin_febr_Louise_1 size_selection Unknown
2405460036 pt01 Batezo 2019-10-25 20231004 AU-GJVH_R-RAXa0UUil1MCik pt01bc BC BC
1102610063 pt02 Batezo 2019-11-04 Monalec_Batezo_2 AYO0H3fNbPBEOKwJA9dv9IH0 6 cfDNA Unknown

5.2.3 Output description

The output of the function is a named list with different entries. I have included a function (explore_AVENIO_runs_Info()) which displays a brief explanation of each entry. The function has no mandatory arguments. To view the explanations run the following command:

# Getting explanations for entries in explore_AVENIO_runs()
explore_AVENIO_runs_Info()

Should return something like:

Name Description
Total_entries Total number of samples registered in Avenio_runs.xlsx
Complete_entries Number of samples with complete info and able to be included in AVENIO_results_patients.rds
Required Samples containing all the required information
Material_stats Number of samples with each designated type of material
Time_stats Number of samples with each designated sample type
Unincluded_analyses Runs entered in Avenio_runs.xlsx but is not present in AVENIO_results_patients.rds
Unincluded_CPRs CPRs entered in Avenio_runs.xlsx but is not present in AVENIO_results_patients.rds
Incomplete_IDs Incomplete Run IDs
Incomplete_dates Samples where dates are missing or wrongly formatted
Incomplete_names Samples where the sample name, project name or name in project is missing
Incomplete_material Samples where the material information is missing

5.3 result_stats()

This function is similar to explore_AVENIO_runs() but is used to get an overview of the results dataset. The function extracts information from the Avenio_results_patients.rds file and generates some basic statistics on the dataset.

The function takes NO mandatory arguments, but takes three optional arguments:

  1. Info - Name of the information of interest, if Info = NULL (default) all information is displayed
  2. silent - A Boolean determining if messages should be displayed. If silent = FALSE (default) messages are displayed
  3. synology_path which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)

5.3.1 Output

A list with different stats on AVENIO_results_patients.rds (Info = NULL) or the specific object as determined by Info.

5.3.2 Example

Two examples of how the function is used is shown below:

# Exploring the results file
results <- result_stats(silent = TRUE)

# Getting the information on all mutations
results$All_mutations

Should return:

Gene n
EGFR 837
TP53 795
APC 398
KIT 362
PIK3CG 325
USP29 316
MET 311
PDZRN3 299
KRAS 272
BRCA2 263

And:

# Getting basic statistics on the NGS results
result_stats(silent = TRUE,
             Info = "Basestats")

Should return something like:

stat n
Patients 800
Samples 1999
Runs 150
Mutations 7999
cfDNA 1570
BC 303
tissue 59
size_selection 35
cfChIP 17
reanalyze 15
Tx 941
BL 538
Unknown 142
tumor_BL 50
Post-Tx 10
tumor_Tx 7

5.3.3 Output description

The output of the function is a named list with different entries. I have included a function (result_stats_Info()) which displays a brief explanation of each entry. The function has no mandatory arguments. To view the explanations run the following command:

# Getting explanations for entries in result_stats()
result_stats_Info()

Should return something like:

Name Description
Basestats Number of samples, runs, patients, different materials etc.
Projectstats Number of patients and samples in each project
Missing Samples present in AVENIO_runs.xlsx but not present in the results data set
All_mutations Number of times each gene is mutated across all samples
Relevant_SNV Number of SNVs detected not classified as BC or synonymous mutations
Relevant_INDEL Number of INDELs detected not classified as BC or synonymous mutations
BC_in_plasma Number of times each gene is mutated in plasma but classified as BC mutation
Fusions_project Number of patients for each project with detectable EML4-ALK with DNAfusion
Fusions_sample Number of samples for each project with detectable EML4-ALK with DNAfusion
Fusions_variant Number of patients with each EML4-ALK variant
Fusions_NC Information on the non classified EML4-ALK variants
Lengths Distribution of fragmentlengths across different materials
Depths Distribution of unique depths across different materials
Reads Distribution of mapped reads across different materials
On_target Distribution of on target percents across different materials

5.4 unlist_frames()

This function takes the list of data.frames containing all mutation information and creates a single data.frame with the results allowing downstream dplyr manipulation on all results. See the figure to see how a list of data.frames are merged into a single data.frame

The function takes one mandatory arguments:

  1. df_list which is the AVENIO_results_patients.rds loaded in using readRDS()

5.4.1 Output

A data.frame where each row is a mutated gene and all the variables from the AVENIO .csv files. One additional variable is added CPR which is extracted from the names in the input df_list.

5.4.2 Example

An example of how the function is used is shown below: Again, the CPR number is not a real CPR number. The output is quite large (many rows) so I have only included 10 rows and selected specific variables of interest

master <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")
master_df <- unlist_frames(master_list = master)
print(master_df)

Which gives the following output:

CPR sample_index Analysis.Name Analysis.ID Mutation.Class Gene
1909208904 IDA-MRD_A_180607 IDAMRD_2.0 AZR_EFZEsjxHLKcIz2BtCTIF NA NA
2709397795 PETERS_LBD_231114 Copy_GlAY_20231207 Aa4vMChZHTdFe76BAPB2R8Pl INDEL EGFR
2709397795 PETERS_LBD_231114 Copy_GlAY_20231207 Aa4vMChZHTdFe76BAPB2R8Pl SNV TP53
2709397795 PETERS_LBD_240801 20241030 AKmgHQ2-w3RHTqVjtWTKLz60 INDEL EGFR
2709397795 PETERS_LBD_240801 20241030 AKmgHQ2-w3RHTqVjtWTKLz60 CNV EGFR
2709397795 PETERS_LBD_240801 20241030 AKmgHQ2-w3RHTqVjtWTKLz60 SNV EYS
3007652609 EVA-EGFR_58_160308 michelleFAST3 AceCNHniaeJHxYPppCoBA3IC SNV MET
1802857106 cfChIP_pt33_180625 20220531 AdYl2rnCR1ZO5JLxgqP9fouS INDEL APC
1802857106 cfChIP_pt33_180625_BC 20231020 AJD93ZaVEQxGcbVGcJdZZW94 INDEL APC
1509995826 super_AUH2_200917 20201007 MonBazo ARPd_ATt_oFBNrmQ7Yk_8YCo SNV EGFR

5.5 remove_sample_index()

This functions removes all results for a given sample_index. This is useful where there are mistakes in the results connected to the sample_index or if the sample_index it self contains errors, e.g., wrong Name_in_project or Sample_date

The function takes two mandatory arguments and three optional

  1. samples - A Character or Character vector with sample_index to be removed
  2. master_list - Which is the AVENIO_results_patients.rds loaded in using readRDS()
  3. synology_path - Which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)
  4. save - A Boolean defining whether the results without samples should be saved (Default: FALSE)
  5. save_as - A Character defining the full path to where the output should be saved (Default: paste0(synology_path, “AVENIO_results_patients.rds”))

If save = FALSE, during the execution the user will get prompted if the results should be saved. Type 1 or 2 in the console in order to decide this. Moreover, if the file defined as save_as exists, the user has to confirm with 1 or 2 if the file should get overwritten with the results returned by remove_sample_index(). these mechanisms have been implemented in order to minimize the risk of accidentally deleting contents from AVENIO_results_patients.rds

5.5.1 Output

The output is similar to the structure of the output from add_run_to_list() but without the specified sample_index

5.5.2 Example

master <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")

rm_sample <- "PROJECT_PATIENT_260217"
master <- remove_sample_index(samples = rm_sample,
                              master_list = master,
                              save = TRUE)

This removes the results from the sample_index “PROJECT_PATIENT_260217” and saves the updated results as “//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds”

5.6 remove_run()

This functions removes all results for a given Run_ID. This is useful where there are multiple mistakes in the results connected to the run or if several sample_index’s contain errors, e.g., wrong Name_in_project or Sample_date.

The Run_ID is the radomly generated ID created by Avenio when the sequencing is initiated. Therefore it is the same type of ID which is used as Directory in add_run_to_list() (see above )

The function takes two mandatory arguments and three optional

  1. run - A Character or Character vector with run_ID to be removed
  2. master_list - Which is the AVENIO_results_patients.rds loaded in using readRDS()
  3. synology_path - Which is the path to the Avenio_runs.xlsx and AVENIO_results_patients.rds files. (Default: “//Synology_m1/Synology_folder/AVENIO/”)
  4. save - A Boolean defining whether the results without samples should be saved (Default: FALSE)
  5. save_as - A Character defining the full path to where the output should be saved (Default: paste0(synology_path, “AVENIO_results_patients.rds”))

If save = FALSE, during the execution the user will get prompted if the results should be saved. Type 1 or 2 in the console in order to decide this. Moreover, if the file defined as save_as exists, the user has to confirm with 1 or 2 if the file should get overwritten with the results returned by remove_sample_index(). these mechanisms have been implemented in order to minimize the risk of accidentally deleting contents from AVENIO_results_patients.rds

5.6.1 Output

The output is similar to the structure of the output from add_run_to_list() but without the specified run_ID

5.6.2 Example

master <- readRDS("//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds")

#Removing a single run from the results automatically
rm_run <- "EXAMPLE_RUN123"
remove_run(run = rm_run,
           master_list = master,
           save = TRUE)

This removes the results from the Run_ID “EXAMPLE_RUN123” and saves the updated results as “//Synology_m1/Synology_folder/AVENIO/AVENIO_results_patients.rds”

6 DNAfusion implementation (Not mandatory read)

As described above DNAfusion is used to analyze all patients for EML4-ALK gene fusions. When a run is added, the BAM files from that run are investigated with DNAfusion before the samples are combined with other sequencing runs from the same patients.

The following two functions from DNAfusion are used

res <- DNAfusion::EML4_ALK_detection(file = "FILE.bam")
var_res <- DNAfusion::find_variants(file = "FILE.bam")

Based on the output of these two functions a new mutation is classified for that sample and added as a results row for that patient. The relevant variables that are based on the DNAfusion results include:

  1. Flags - Set as “DNAfusion” to illustrate it is detected by DNAfusion
  2. Mutation.Class - Set to “FUSION” similar to Avenio format
  3. Gene - Set to “ALK;EML4” similar to Avenio format
  4. Variant.Description - Classifies the EML4-ALK fusion variant if possible
  5. Allelle.Fraction - The fraction of softclipped reads at EML4 divided by the coverage at the ALK breakpoint
  6. Genomic.Position - The breakpoint positions in ALK and EML4
  7. Variant.Depth - The number of softclipped reads at EML4 breakpoint
  8. Unique.Depth - The unique number of reads at the ALK breakpoint
  9. Exon.Number - Illustrates which intron the breakpoint occurs in for ALK and EML4

These results are also maintanied when exploring the results with create_simple_output()

Upon exploration it was identified that the variant: Genomic.Position = "chr2:29223530;chr2:42295516" was detected in several samples however, it could not be classified. If this variant is detected, the fusion is classified as: Variant.Description = "Uncertain_variant" and should be removed from downstream analyses.

7 list_rebuild.R AND AVENIO_runs.xlsx recovery (Not mandatory read)

Only if absolutely necessary on the github page there is a file called list_rebuild.R. This file is used to rebuild the AVENIO_results_patients.rds in the case the file is lost or corrupted. Or in case an error to the data set has been found and all samples need to be reanalyzed.

The file only works if the AVENIO_runs.xlsx file is not lost.

list_rebuild.R has to be updated manually alongside AVENIO_runs.xlsx, but Christoffer will make sure this happens relatively regularly.

When the entire script has been run you should be able to investigate which runs (if any) that have not been added to the dataset but are present in the AVENIO_runs.xlsx file using:

explore_AVENIO_runs(silent = TRUE,
                    Info = "Unincluded_analyses")

And you can then use add_run_to_list() to add the missing runs to the dataset

In case the AVENIO_runs.xslx file is lost, a backup of this file can be found on genomeDK at /faststorage/project/alabs_projects/Avenio_BAM/. This file is being updated every month by Christoffer to ensure it is relatively up to date at all times.

8 Session info (Not mandatory read)

It is good practice to include this information in tutorials and manuals to ensure reproducibility and for troubleshooting if something does not work. If your code works, discount this section.

#> Warning in system2("quarto", "-V", stdout = TRUE, env = paste0("TMPDIR=", :
#> kørende kommando '"quarto"
#> TMPDIR=C:/Users/chris/AppData/Local/Temp/RtmpKyN3Fp/file6be8786074e1 -V' havde
#> status 1
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.0 (2025-04-11 ucrt)
#>  os       Windows 11 x64 (build 26200)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Danish_Denmark.utf8
#>  ctype    Danish_Denmark.utf8
#>  tz       Europe/Copenhagen
#>  date     2026-02-20
#>  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>  quarto   NA @ C:\\PROGRA~1\\RStudio\\RESOUR~1\\app\\bin\\quarto\\bin\\quarto.exe
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  AvenioUpdate * 1.14.2  2026-02-19 [1] Github (CTrierMaansson/AvenioUpdate@386f4b1)
#>  crayon       * 1.5.3   2024-06-20 [1] CRAN (R 4.5.0)
#>  dplyr        * 1.1.4   2023-11-17 [1] CRAN (R 4.5.0)
#> 
#>  [1] C:/Users/chris/AppData/Local/R/win-library/4.5
#>  [2] C:/Program Files/R/R-4.5.0/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────